student solution
BacPrep: An Experimental Platform for Evaluating LLM-Based Bacalaureat Assessment
Marius, Dumitran Adrian, Radu, Dita
Accessing quality preparation and feedback for the Romanian Bacalaureat exam is challenging, particularly for students in remote or underserved areas. This paper introduces BacPrep, an experimental online platform exploring Large Language Model (LLM) potential for automated assessment, aiming to offer a free, accessible resource. Using official exam questions from the last 5 years, BacPrep employs one of Google's newest models, Gemini 2.0 Flash (released Feb 2025), guided by official grading schemes, to provide experimental feedback. Currently operational, its primary research function is collecting student solutions and LLM outputs. This focused dataset is vital for planned expert validation to rigorously evaluate the feasibility and accuracy of this cutting-edge LLM in the specific Bacalaureat context before reliable deployment.
- Europe > Romania > București - Ilfov Development Region > Municipality of Bucharest > Bucharest (0.05)
- Asia > China > Hubei Province > Wuhan (0.04)
LLMs cannot spot math errors, even when allowed to peek into the solution
Srivatsa, KV Aditya, Maurya, Kaushal Kumar, Kochmar, Ekaterina
Large language models (LLMs) demonstrate remarkable performance on math word problems, yet they have been shown to struggle with meta-reasoning tasks such as identifying errors in student solutions. In this work, we investigate the challenge of locating the first error step in stepwise solutions using two error reasoning datasets: VtG and PRM800K. Our experiments show that state-of-the-art LLMs struggle to locate the first error step in student solutions even when given access to the reference solution. To that end, we propose an approach that generates an intermediate corrected student solution, aligning more closely with the original student's solution, which helps improve performance.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- Asia > Singapore (0.04)
- (7 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.93)
Benchmarking Large Language Models on Homework Assessment in Circuit Analysis
Chen, Liangliang, Qin, Zhihao, Guo, Yiming, Rohde, Jacqueline, Zhang, Ying
Large language models (LLMs) have the potential to revolutionize various fields, including code development, robotics, finance, and education, due to their extensive prior knowledge and rapid advancements. This paper investigates how LLMs can be leveraged in engineering education. Specifically, we benchmark the capabilities of different LLMs, including GPT-3.5 Turbo, GPT-4o, and Llama 3 70B, in assessing homework for an undergraduate-level circuit analysis course. We have developed a novel dataset consisting of official reference solutions and real student solutions to problems from various topics in circuit analysis. To overcome the limitations of image recognition in current state-of-the-art LLMs, the solutions in the dataset are converted to LaTeX format. Using this dataset, a prompt template is designed to test five metrics of student solutions: completeness, method, final answer, arithmetic error, and units. The results show that GPT-4o and Llama 3 70B perform significantly better than GPT-3.5 Turbo across all five metrics, with GPT-4o and Llama 3 70B each having distinct advantages in different evaluation aspects. Additionally, we present insights into the limitations of current LLMs in several aspects of circuit analysis. Given the paramount importance of ensuring reliability in LLM-generated homework assessment to avoid misleading students, our results establish benchmarks and offer valuable insights for the development of a reliable, personalized tutor for circuit analysis -- a focus of our future work. Furthermore, the proposed evaluation methods can be generalized to a broader range of courses for engineering education in the future.
- Europe > Austria > Vienna (0.14)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- (10 more...)
- Overview (1.00)
- Research Report > New Finding (0.86)
- Instructional Material > Course Syllabus & Notes (0.67)
- Education > Educational Setting (1.00)
- Education > Curriculum > Subject-Specific Education (1.00)
PyEvalAI: AI-assisted evaluation of Jupyter Notebooks for immediate personalized feedback
Wandel, Nils, Stotko, David, Schier, Alexander, Klein, Reinhard
Grading student assignments in STEM courses is a laborious and repetitive task for tutors, often requiring a week to assess an entire class. For students, this delay of feedback prevents iterating on incorrect solutions, hampers learning, and increases stress when exercise scores determine admission to the final exam. Recent advances in AI-assisted education, such as automated grading and tutoring systems, aim to address these challenges by providing immediate feedback and reducing grading workload. However, existing solutions often fall short due to privacy concerns, reliance on proprietary closed-source models, lack of support for combining Markdown, LaTeX and Python code, or excluding course tutors from the grading process. To overcome these limitations, we introduce PyEvalAI, an AI-assisted evaluation system, which automatically scores Jupyter notebooks using a combination of unit tests and a locally hosted language model to preserve privacy. Our approach is free, open-source, and ensures tutors maintain full control over the grading process. A case study demonstrates its effectiveness in improving feedback speed and grading efficiency for exercises in a university-level course on numerics.
- Research Report (0.83)
- Instructional Material > Course Syllabus & Notes (0.68)
- Education > Educational Setting (0.95)
- Education > Curriculum > Subject-Specific Education (0.68)
- Education > Assessment & Standards > Student Performance (0.48)
- Education > Educational Technology > Educational Software > Computer Based Training (0.35)
Evaluating GPT-4 at Grading Handwritten Solutions in Math Exams
Caraeni, Adriana, Scarlatos, Alexander, Lan, Andrew
Recent advances in generative artificial intelligence (AI) have shown promise in accurately grading open-ended student responses. However, few prior works have explored grading handwritten responses due to a lack of data and the challenge of combining visual and textual information. In this work, we leverage state-of-the-art multi-modal AI models, in particular GPT-4o, to automatically grade handwritten responses to college-level math exams. Using real student responses to questions in a probability theory exam, we evaluate GPT-4o's alignment with ground-truth scores from human graders using various prompting techniques. We find that while providing rubrics improves alignment, the model's overall accuracy is still too low for real-world settings, showing there is significant room for growth in this task.
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.35)
Stepwise Verification and Remediation of Student Reasoning Errors with Large Language Model Tutors
Daheim, Nico, Macina, Jakub, Kapur, Manu, Gurevych, Iryna, Sachan, Mrinmaya
Large language models (LLMs) present an opportunity to scale high-quality personalized education to all. A promising approach towards this means is to build dialog tutoring models that scaffold students' problem-solving. However, even though existing LLMs perform well in solving reasoning questions, they struggle to precisely detect student's errors and tailor their feedback to these errors. Inspired by real-world teaching practice where teachers identify student errors and customize their response based on them, we focus on verifying student solutions and show how grounding to such verification improves the overall quality of tutor response generation. We collect a dataset of 1K stepwise math reasoning chains with the first error step annotated by teachers. We show empirically that finding the mistake in a student solution is challenging for current models. We propose and evaluate several verifiers for detecting these errors. Using both automatic and human evaluation we show that the student solution verifiers steer the generation model towards highly targeted responses to student errors which are more often correct with less hallucinations compared to existing baselines.
- Europe > Switzerland > Zürich > Zürich (0.04)
- Asia > Singapore (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- (8 more...)
- Health & Medicine (0.68)
- Education > Curriculum > Subject-Specific Education (0.67)
- Education > Educational Technology > Educational Software (0.46)
Estimating Difficulty Levels of Programming Problems with Pre-trained Model
Wang, Zhiyuan, Zhang, Wei, Wang, Jun
As the demand for programming skills grows across industries and academia, students often turn to Programming Online Judge (POJ) platforms for coding practice and competition. The difficulty level of each programming problem serves as an essential reference for guiding students' adaptive learning. However, current methods of determining difficulty levels either require extensive expert annotations or take a long time to accumulate enough student solutions for each problem. To address this issue, we formulate the problem of automatic difficulty level estimation of each programming problem, given its textual description and a solution example of code. For tackling this problem, we propose to couple two pre-trained models, one for text modality and the other for code modality, into a unified model. We built two POJ datasets for the task and the results demonstrate the effectiveness of the proposed approach and the contributions of both modalities.
- Asia > China > Shanghai > Shanghai (0.05)
- North America > United States > New York > New York County > New York City (0.04)
MathDial: A Dialogue Tutoring Dataset with Rich Pedagogical Properties Grounded in Math Reasoning Problems
Macina, Jakub, Daheim, Nico, Chowdhury, Sankalan Pal, Sinha, Tanmay, Kapur, Manu, Gurevych, Iryna, Sachan, Mrinmaya
While automatic dialogue tutors hold great potential in making education personalized and more accessible, research on such systems has been hampered by a lack of sufficiently large and high-quality datasets. Collecting such datasets remains challenging, as recording tutoring sessions raises privacy concerns and crowdsourcing leads to insufficient data quality. To address this, we propose a framework to generate such dialogues by pairing human teachers with a Large Language Model (LLM) prompted to represent common student errors. We describe how we use this framework to collect MathDial, a dataset of 3k one-to-one teacher-student tutoring dialogues grounded in multi-step math reasoning problems. While models like GPT-3 are good problem solvers, they fail at tutoring because they generate factually incorrect feedback or are prone to revealing solutions to students too early. To overcome this, we let teachers provide learning opportunities to students by guiding them using various scaffolding questions according to a taxonomy of teacher moves. We demonstrate MathDial and its extensive annotations can be used to finetune models to be more effective tutors (and not just solvers). We confirm this by automatic and human evaluation, notably in an interactive setting that measures the trade-off between student solving success and telling solutions. The dataset is released publicly.
- North America > United States > Washington > King County > Seattle (0.28)
- North America > United States > New York > New York County > New York City (0.04)
- Europe > Switzerland > Zürich > Zürich (0.04)
- (14 more...)
- Education > Educational Setting (1.00)
- Education > Educational Technology > Educational Software > Computer Based Training (0.93)
Teaching UML Skills to Novice Programmers Using a Sample Solution Based Intelligent Tutoring System
Schramm, Joachim (Clausthal University of Technology) | Strickroth, Sven (Clausthal University of Technology) | Le, Nguyen-Thinh (Clausthal University of Technology) | Pinkwart, Niels (Clausthal University of Technology)
Modeling skills are essential during the process of learning programming. ITS systems for modeling are typically hard to build due to the ill-definedness of most modeling tasks. This paper presents a system that can teach UML skills to novice programmers. The system is “simple and cheap” in the sense that it only requires an expert solution against which the student solutions are compared, but still flexible enough to accommodate certain degrees of solution flexibility and variability that are characteristic of modeling tasks. An empirical evaluation via a controlled lab study showed that the system worked fine and, while not leading to significant learning gains as compared to a control condition, still revealed some promising results.
- North America > United States > New York > New York County > New York City (0.04)
- Europe > United Kingdom > England > West Yorkshire > Leeds (0.04)
- Europe > Germany (0.04)